This is an analysis of the responses to Kaggle’s 2017 user survey. In total, 16,716 people responded to enough of the survey to be analyzed. Allow to this survey and this following analysis, we can know how to manage and launch a data science team.
All data was collected by Kaggle in their 2017 user survey. The survey was available to users from August 7 - August 25, 2017.
Inspired by this kernel
# For Data Cleaning
library(tidyverse)
library(ggplot2)
library(plotly)
# For text analysis and word clouds
library(tm)
library(SnowballC)
library(wordcloud)Based on the way this data is structured, I want to keep the first row as column headers.
# Import multiple choice data
Survey <- read.csv('../input/multipleChoiceResponses.csv', stringsAsFactors = TRUE, header = TRUE)
# Import freeform responses
rawFFData <- read.csv('../input/freeformResponses.csv', stringsAsFactors = FALSE, header = TRUE)
# Import the actual questions asked
schema <- read.csv('../input/schema.csv', stringsAsFactors = FALSE, header = TRUE)Last, I need to import the currency conversion rates for use later.
conversionRates <- read.csv('../input/conversionRates.csv', header = TRUE)We’re creating an id column. It will use later
Survey = Survey %>%
mutate(id = 1:nrow(Survey))We have to cut the age in order to analyse easiest possible.
Survey$Age <- as.numeric(as.character(Survey$Age))
ggplot(Survey,aes(Age)) +
geom_histogram(bins = 100,color = "blue")breaks_Age = c(min(Survey$Age,na.rm = T),"20","25","30","45",max(Survey$Age,na.rm = T))
Survey$Age = cut(Survey$Age,breaks_Age,include.lowest = TRUE)size
map
computer ## Where can you find data scientist ? School Reconversion
df = Survey %>%
select(CurrentJobTitleSelect) %>%
group_by(CurrentJobTitleSelect) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
filter(CurrentJobTitleSelect != "")
custom_text = sprintf("Job: %s <b> freq : %d",df$CurrentJobTitleSelect,df$count)
p = ggplot(df,aes(x = reorder(CurrentJobTitleSelect,-count),y = count,fill = reorder(CurrentJobTitleSelect,-count),text = custom_text))+
geom_bar(stat = "identity")
ggplotly(p,tooltip = "text")We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`
foo = Survey %>%
select(Age,CurrentJobTitleSelect,DataScienceIdentitySelect) %>%
filter(CurrentJobTitleSelect != "") %>%
filter(DataScienceIdentitySelect != "") %>%
filter(CurrentJobTitleSelect %in% df$CurrentJobTitleSelect[1:10]) %>%
ungroup() %>%
filter(complete.cases(.)) %>%
group_by(Age,CurrentJobTitleSelect,DataScienceIdentitySelect) %>%
summarise(count = n()) %>%
droplevels()
alluvial(foo[,1:3], freq=foo$count,
col = ifelse(foo$DataScienceIdentitySelect == "Yes", "orange", "grey"),
border = ifelse(foo$DataScienceIdentitySelect == "Yes", "orange", "grey"),
cex = 0.7
)DS = Survey$id[which(Survey$DataScienceIdentitySelect == "Yes")]
DS = c(DS,Survey$id[which(Survey$CurrentJobTitleSelect == "Data Scientist")])
foo = Survey %>%
mutate(IsDS = id %in% DS) %>%
select(FormalEducation,MajorSelect,CurrentJobTitleSelect,IsDS) %>%
filter(CurrentJobTitleSelect != "") %>%
filter(FormalEducation != "") %>%
filter(MajorSelect != "") %>%
#filter(CurrentJobTitleSelect %in% df$CurrentJobTitleSelect[1:10]) %>%
ungroup() %>%
filter(complete.cases(.)) %>%
group_by(FormalEducation,MajorSelect,CurrentJobTitleSelect,IsDS) %>%
summarise(count = n()) %>%
filter(count > 50) %>%
droplevels()
alluvial(foo[,1:4], freq=foo$count,
col = ifelse(foo$IsDS == "FALSE", "orange", "grey"),
border = ifelse(foo$IsDS == "FALSE", "orange", "grey"),
#hide = foo$count < 50,
cex = 0.7
)This graph can help you to know where you can find DS. For example : - Most of the bachelor’s degree are not DS - A good Software Developer with a Doctoral degree can be as DS. Otherwise, it’s not a DS.
recommendations <- Survey %>%
# Remove any non-entries for either question
filter(!WorkToolsSelect == "") %>%
filter(!LanguageRecommendationSelect == "") %>%
# Select only the columns for the language recommendations and language use
select(WorkToolsSelect, LanguageRecommendationSelect) %>%
# Split the language usage column at the comma
mutate(WorkToolsSelect = strsplit(as.character(WorkToolsSelect), '\\([^)]+,(*SKIP)(*FAIL)|,\\s*', perl = TRUE)) %>%
# Split answers are now nested, need to unnest them
unnest(WorkToolsSelect) %>%
# Group by language used and then by recommendation
group_by(WorkToolsSelect, LanguageRecommendationSelect) %>%
# Rename the columns
rename(Used = WorkToolsSelect, Recommended = LanguageRecommendationSelect) %>%
# Count the number of responses for each language use/recommendation combination
summarise(count = n()) %>%
filter(count >300)
# Display the results
recommendations$Used = as.factor(recommendations$Used)
recommendations = recommendations %>%
droplevels()circos.clear()
chordDiagram(recommendations,annotationTrack = "grid",,preAllocateTracks = 1)
circos.trackPlotRegion(track.index = 1, panel.fun = function(x, y) {
xlim = get.cell.meta.data("xlim")
ylim = get.cell.meta.data("ylim")
sector.name = get.cell.meta.data("sector.index")
circos.text(mean(xlim), ylim[1] + .1, sector.name, facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5), cex=.6)
# circos.axis(h = "top", labels.cex = 0.5, major.tick.percentage = 0.2, sector.index = sector.name, track.index = 2)
}, bg.border = NA)As you can see, most of them use Python and R. If they have to recommand it, most of them choose Pyhton. The cirle show that most of the people work with other that R and python, they recommand it python and R.
We can look up which technologies were considered to be the most important on the job. Here, I plotted popularity against usefulness again to see which technologies are used the most in the real world.
With this graph, we can understand that Python is the most useful and popular skill. The second one is Stats, then BigData, R, SQL and Visualizations.
Now you know which skills are important for your data scientist, you need to know HOW to learn it. You can provide them some plateform. Let’s see the most usefulness :
So, you can propose them Kaggle, Courses, Stack Overflow and Projects.
predict the salary (model regression)